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Abstract 

In highly repetitive strings, like collections of genomes from the same species, distinct measures of 
repetition all grow sublinearly in the length of the text, and indexes targeted to such strings typically 
depend only on one of these measures. We describe two data structures whose size depends on multiple 
measures of repetition at once, and that provide competitive tradeoffs between the time for counting and 
reporting all the exact occurrences of a pattern, and the space taken by the structure. The key component 
of our constructions is the run-length encoded BWT (RLBWT), which takes space proportional to 
the number of BWT runs: rather than augmenting RLBWT with suffix array samples, we combine it 
with data structures from LZ77 indexes, which take space proportional to the number of LZ77 factors, 
and with the compact directed acyclic word graph (CDAWG), which takes space proportional to the 
number of extensions of maximal repeats. The combination of CDAWG and RLBWT enables also a 
new representation of the suffix tree, whose size depends again on the number of extensions of maximal 
repeats, and that is powerful enough to support matching statistics and constant-space traversal. 


1 Introduction 

The space taken by compressed data structures for highly-repetitive strings is typically a function of a specific 
measure of repetition, for example the number x of factors in a Lempel-Ziv parsing mm, or the number 
r of runs in a Burrows-Wheeler transform |14j . For many such compressed data structures, computing all 
the occurrences of a pattern in the indexed string is a bottleneck. In this paper we explore the advantages 
of combining data structures that depend on distinct measures of repetition. Specifically, we describe a data 
structure that takes approximately 0{z-\-r) words of space, and that reports all the occurrences of a pattern 
of length m in 0{m{log log n + log z) + pocc log*^ z -I- socc log log n) time, where n is the length of the string 
and pocc and socc are the number of primary and of secondary occurrences, respectively (see Sect ion [2. 2 [ for 
definitions). This compares favorably to the Oirnfh + (m -|- occ) logz) reporting time of LZ77 indexes~(ll). 
where h is the height of the parse tree. It also compares favorably in space to solutions based on run-length 
encoded BWT (RLBWT) and suffix array samples [14] . which take 0{n/k + r) words of space to achieve 
0(m log log n -b k ■ occ log log n) reporting time, where fc is a sampling rate. 

We also introduce a new measure of the repetitiveness of a string, the number e of right extensions 
of maximal repeats, which is related to the number of arcs in the compact directed acyclic word-graph 

*This work was partially supported by Academy of Finland under grant 250345 (Center of Excellence in Cancer Genetics 
Research). 
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Figure 1: Growth of the number of maximal repeats \M.t\ (black circles), of |£^U (white circles, e in the 
introduction), of the number of runs in BWT \TZt\ (squares, r in the introduction), and of \Zt\ (triangles, 
z in the introduction) in a concatenation T of 39 highly similar Saccharomyces cerevisiae genomes [8] (see 
Section]^ for definitions). Left: growth inside the first genome of the database. Center: growth after the 
addition of each genome (one sample per genome). Right: the same as the plot in the center, but with each 
curve normalized by its first sample. and \Z7p\ are not shown since they behave approximately 

as their symmetrical counterparts. 

(CDAWG) and which is an upper bound on r and z. We show a data structure whose size depends on e and 
that reports all the occ occurrences of a pattern of length to in a string of length n in O^mloglogn + occ) 
time. The main component of our constructions is the RLBWT, which we use to count the number of 
occurrences of a pattern, and which we combine with the CDAWG and with data structures from LZ 
indexes, rather than with suffix array samples, for reporting. Similar combinations have already appeared 
in the literature, but their space has been related to statistical compressibility rather than to the number 
of repetitions: for example, an FM-index has already been combined with an LZ78 self-index to achieve 
faster search or reporting mm, but the size of the resulting data structure depends on k-th order empirical 
entropy. 

Combining the RLBWT with the CDAWG enables also a new representation of the suffix tree, which 
takes space proportional to e -I- (where is the number of left extensions of maximal repeats) and which 
supports a number of operations in O(loglogn) time. Among other properties, this new representation 
allows computing the matching statistics of a pattern of length to in 0{m\oglogn) time. Our constructions 
are targeted to highly-repetitive strings, like large databases of similar genomes, in which all the measures 
of repetition on which our data structures depend grow sublinearly in the size of the database (see Figure 
for an example). 


2 Preliminaries 

Let E = [l..cr] be an integer alphabet, let # = 0 ^ E be a separator, and let T = be a string. We 

denote the reverse of T by T. Given a substring W of T, let Vt{W) be the set of all starting positions of W 
in the circular version of T. A repeat W is a string that satisfies \Vt{W)\ > 1. We denote by the set 

of characters {a G [0..(t] : \VTiaW)\ > 0} and by Ey(VF) the set of characters {b G [0..cr] : \VT(Wb)\ > 0}. 
A repeat W is right-maximal (respectively, left-maximal) iff |Ey(VF)| > 1 (respectively, iff |E^(kF)| > 1). 
It is well known that T can have at most n — 1 right-maximal substrings and at most n — 1 left-maximal 
substrings. A maximal repeat of T is a repeat that is both left- and right-maximal: we call Mt the set of 
all maximal repeats of T. A maximal repeat W can be seen as a set of right-maximal substrings of T, and 
specifically as the set of all right-maximal strings IF[i..|kF|] for i G [1..A:] that are not left-maximal, and such 
that W\k -\- l..|IT|] is left-maximal. 

For reasons of space we assume the reader to be familiar with the notion of suffix tree STj’ = (V, E) of T, 
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which we do not define here. We denote by ^( 7 ), or equivalently by £{u, v), the label of edge 7 = {u, v) £ E, 
and we denote by £(v) the string label of node v € V. It is well known that a substring W of T is 
right-maximal (respectively, left-maximal) iS W = £{v) for some internal node v of ST^ (respectively, iff 
W = £(v) for some internal node v of ST^). We assume the reader to be familiar with the notion of suffix 
link connecting a node v with £{v) = aW for some a € [0..cr] to a node w with £(w) = W: we say that 
w = suf f ixLink(?;) in this case. Here we just recall that inverting the direction of all suffix links yields 
the so-called explicit Weiner links. Given an internal node v and a symbol a € [0..cr], it might happen that 
string a£(v) does occur in T, but that it is not right-maximal, i.e. it is not the label of any internal node: all 
such left extensions of internal nodes that end in the middle of an edge are called implieit Weiner links. An 
internal node can have more than one outgoing Weiner link, and all such Weiner links have distinct labels. 

The compact directed acyclic word graph of a string T (denoted by CDAWG^ in what follows) is the 
minimal compact automaton representing the set of suffixes of a given string I3E]. It can be seen as the 
minimization of ST^, in which all leaves are merged to the same node (the sink) that represents T itself, 
and in which all nodes except the sink are in one-to-one correspondence with the maximal repeats of T [ 16 ]. 
Since a maximal repeat corresponds to a set of right-maximal substrings, CDAWGt’ can be built by putting in 
the same equivalence class all nodes of ST t that belong to the same maximal unary path of explicit Weiner 
links. 

For reasons of space we assume the reader to be familiar with the notion and uses of the Burrows-Wheeler 
transform of T, including the C array and backward searching. In this paper we use BWTt’ to denote the 
BWT of r, and we use range(IF) = [sp(IF)..ep(IF)] to denote the lexicographic interval of a string W in 
a BWT that is implicit from the context. We say that BWT 7 '[j..j] is a run iff BWTyjfc] = c £ [0..cr] for all 
k £ [i..j], and moreover if any substring BWTj’[i'../] such that i' < i, j' > j, and either i' i or f ^ j, 
contains at least two distinct characters. It is well known that repetitions in T tend to be converted into 
runs of BWTt- We denote by TZt the set of all triplets {c,i,j) such that BWTT[b.j] is a run of character c, 
and we use and fx as shorthands for \TIt\ and respectively. 

The LZ77 factorization of T is the greedy decomposition T 1 T 2 • • • Tj, of T obtained as follows. Assume 
that T is virtually preceded by the a distinct characters in its alphabet, and assume that T 1 T 2 ■ ■ ■ Ti has 
already been computed for some prefix of length k oi T: then, is the longest prefix of T[k + l..n] such 
that there is a j < k that satisfies T[j..j + \Ti+i I - 1] = T,+i- We denote by Zt the set of pairs {Ti,pi) for 
all i £ [I..z], where pi is the starting position of Ti in T, and we use zt as a shorthand for \Zt\. From now 
on, we drop subscripts whenever the string T they specify is clear from the context. 

2.1 Relationships among maximal repeats, runs in BWT, and LZ factors 

Clearly \TZ\ can be as small as two, e.g. in string 0 ”“^#, and as large as 0(n), e.g. in the string of length 
n that contains exactly n distinct characters, or in a de Bruijn string of order fc > 1 on a binary alphabet: 
this string of length -|- fc — 1 contains all the distinct k-mers, thus the interval of every (k — l)-mer in 
BWTt contains exactly a distinct characters, and the number of runs in BWTt is thus at least a^~^(k — I). 
It is known that \Z\ is 0{n/ log^n) [12], and it can be constant, e.g. in Conversely, \Wi\ can be 

zero, e.g. in a string of length n that contains exactly n distinct characters, and it can be 0 (n) in the 
worst case, e.g. in string When maximal repeats exist, the number of right extensions of maximal 

repeats Il(logn) (see Lemma in the appendix), and this lower bound is matched by 

Fibonacci strings and by Thue-Morse strings of length n, whose CDAWG contains O(logn) nodes [151 II7j . 
Both |A1|/|7?.| and |A1|/|Z| can be 0(n), for example in the already mentioned 0”“^#. |7^|/|Z| can be 
0(logn), e.g. in the already mentioned de Bruijn string T of order k, which has 0(n/log^n) LZ factors. 
However, |A1|, \Tl\ and \Z\ can all grow at the same asymptotic rate in the same family of strings. Consider 
e.g. string T = 0^10^1 • • • of length x{x + ‘T)/2 + 1. Clearly \Z\ = x -\- 3, and \M.\ = 3{x — 1) since 
the maximal repeats of T are only the substrings 0*1 for i £ [l..x — 1], 0-^ for j £ [l..x — 1], and for 

k £ [2..X — 1]. Replacing ff with a new block in string T creates two new runs for every cc > 1, thus 

\R.\ = 2x for X > 1. 

Recall that a substring IF of T is a maximal repeat iff IF = £{v) for some internal node v of STt = (F, E), 
and moreover if there are at least two Weiner links from v. Since the set of all left-maximal substrings of T 
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is closed under the prefix operation, there is a bijection between M. and the nodes that lie on the paths of 
STt that start from the root and that end at nodes labeled by maximal repeats defined as follows: 

Definition 1. A maximal repeat W of a string T G is rightmost if no string WV with V G [O.-ct]’*' 

is left-maximal in T. 

We denote the set of rightmost maximal repeats of T by At We also denote by the set of edges of 
STt’ that connect pairs of nodes labeled by maximal repeats, and we denote by Ttp the set of edges {v,w) 
in STy such that £{v) G Mt and £{w) ^ Mt- We use and Ffp to denote symmetrical concepts in 

STy, and we use ct and as shorthands for \£tj,\ + and for \£fp\ + respectively. Clearly S'" and 
J-'' are the image of explicit and implicit Weiner links of ST^: 

Lemma 1. Let ST^ = {V,E). There is a bijection between Eif and the set of all explicit Weiner links from 
nodes of ST Tp that correspond to maximal repeats ofT. There is a bijection between Tp and the set of all 
implicit Weiner links from nodes of ST p that correspond to maximal repeats ofT. 

The proof of Lemma is provided in the appendix. It is clear that the set of suffix tree edges Ep U J-p 
is in one-to-one correspondence with the set of all arcs of CDAWG^. This set of edges is also related to runs 
in BWTt: 

Theorem 1. |[0..a] \ Uw'6Ai^S^(W)| + Wl - I-^tI + 1 < I^^tI < |A^|. 

Proof. The root of STt is a maximal repeat, thus the destinations of all edges in A’’ partition all leaves of 
STt into disjoint subtrees, or equivalently they partition the entire BWTt in disjoint blocks. Since every 
such block is the interval in BWTt of some string that is not left-maximal, all characters of BWTt in the 
same block are identical, thus the number of runs in BWTt cannot be bigger than |A’'|. 

The interval of a string W G in BWTt contains exactly |E^(W)| distinct characters, and at most one 
of them is identical to the character that precedes the largest suffix of T smaller than W in lexicographic 
order (note that such suffix might not be prefixed by any string in Af'’). Thus, the number of runs in BWTt 
is at least ~ \M.^\ + \. Factor [O..CT]\UiyGA^’'^T(^) ^^e claim takes into account symbols 

of T that never occur to the left of strings in Af’'. □ □ 

A symmetrical argument holds for TZp. The set of arcs in CDAWGt is also related to the LZ factorization 
ofT: 

Theorem 2. \Zt\ < \E£pUP^\ 

Proof. Let T = T 1 T 2 ... Tj, be the LZ factorization of T, and let pi,p 2 , ■ ■ ■ ,Pz be the sequence such that pi is 
the starting position of factor Ti in T. Every factor is a right-maximal substring of T, but it is not necessarily 
left-maximal: let Wi be a suffix of T[l..pi — 1] such that WiTi is both right-maximal and left-maximal, and 
assume that we assign Ti to the edge (f,u>) in Ep U Pp such that l{v) = Wip, v = parent(r(;), and the 
first character of T^+i equals the first character of i{v,w). Assume that there is some j > i for which we 
assign Tj to the same maximal repeat Wip. Then, the first character of Tj+i must be different from the 
first character of T^+i, otherwise factor Tj would have been longer. It follows that every LZ factor can be 
assigned to a distinct element of Ep U Pp. □ □ 

The gap between r and e, and between 2 : and e, is apparent from Figure]^ (center). However, all these 
measures seem to grow at the same relative rate in practice (right panel). 

2.2 Repetition-aware data structures 

Given a string T G [l..cr]”“^#, we call run-length encoded BWT any representation of BWTt that takes 
0(|72 .t|) words of space, and that supports rank and select operations: see for example [121 HU EH] ■ Let 
TZt be a set of triplets {c,i,j) such that BWTt[*..j] is a run of character c. It is easy to implement rank in 
O (log log n) time, by encoding Tip as cr + 1 predecessor data structures [H], each of which stores the second 
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component of all triplets with the same first component. For every such second component i, we also store 
in an array the sum of all occurrences of c up to i, exclusive. To implement select in O(loglogn) time, we 
can similarly encode TZt as ct + 1 predecessor data structures, each of which stores value rankc(BWTT, * — 1) 
for all triplets (c, i,j) with the same value of c. We also store the value of i for every such triplet. We denote 
the run-length encoded BWT of T by RLBWT-r. 

For reasons of space we assume the reader to be familiar with LZ77-indexes: see e.g. [iniiii- Here 
we just recall that a primary occurrence of a pattern P in a string T € is one that crosses 

or ends at a phrase boundary in the LZ77 factorization T 1 T 2 ■ ■ ■ of T. All other occurrences are called 
secondary. Once we have determined all primary occurrences, locating secondary occurrences reduces to 
two-sided range reporting and takes O(occloglogn) time with a data structure that takes 0{z) words of 
space m- To locate primary occurrences, we can use a data structure for four-sided range reporting on a 
z X z grid, with a marker at (x, y) if the xth LZ factor in lexicographic order is preceded in the text by the 
lexicographically yth reversed prefix ending at a phrase boundary. This data structure takes 0{z) words of 
space, and it returns all the phrase boundaries immediately followed by a factor in the specified range, and 
immediately preceded by a reversed prefix in the specified range, in 0((1 -I- fc) log*^ z) time, where k is the 
number of phrase boundaries reported [3]. 


3 Combining runs in BWT and LZ factors 

In this section we describe how to combine data structures whose size depends on the number of LZ factors 
of a string T G [l..cr]”“^#, and data structures whose size depends on the number of runs in BWTt, to 
report all the occurrences of a pattern in T. To do so, we first need to solve the following subproblem. Let 
STt = (V,E) be the suffix tree of T, and let V’ = {ui,U2,... ,Ufc} C H be a subset of the nodes of STy. 
Consider the list of node labels L = £(ui),£(u2),...,I'(ufe), sorted in lexicographic order. Given a string 
W G [0..(t]*, we want to implement function 1{W,V') that returns the (possibly empty) interval of W in L. 
The following lemma describes how to do this in 0(k) words of space: 

Lemma 2. Let T G be a string, and let V be a subset of k nodes of its suffix tree, represented 

as intervals in BWTr. Given the interval [i..j] of a string W G [0..cr]* in BWTt, there is a data structure 
that takes 0{k) words of space and that computes 1{W,V') in 0(logA:) time. 

Proof. We store a bitvector first[l..n] such that firstjz] = 1 iff there is a node v' G V' such that 
raiige(u') = {i..j]. Similarly, we store a bitvector last[l..n] such that last[j] = 1 iff there is a node 
v' G V such that rcLnge(u') = [z..j]. Let a and /3 be the number of ones in first and last, respectively. 
We build prefix-sum arrays First and Last on such bitvectors using 0(fc) words of space, and we discard 
first and last. Let A[l..a] be the array such that Aji] equals the number of intervals [p..q] such that p is 
the Ah one in first and [p..q\ = range(u') for a node v' G V. Similarly, let L[l../3] be the array such that 
L[i] equals the number of intervals [p..q] such that q is the Ah one in last and [p..q\ = range(u') for a node 
v' G V'. We represent P and L as prefix-sum arrays using 0(k) words of space, and we discard P and L. 

Let I(VF, V') = [x..y\. Given the interval \i..f\ of a string W in BWT^, we find the corresponding interval 
[i'..j'] in array first in O(loga) time, using binary search on First. Specifically, i' — min{p G [l..a] : 
First[p] > *} and j' = max{g G [l..a] : Firstjq] < j}. If j' < i! then W is not the prefix of a label of a node 
in V. Otherwise, since all nodes v' G V whose BWT interval starts inside [i -F l..j] are right extensions of 
W, we set y = X)p=i ^[p] constant time using the prefix-sum representation of P. If Firstji'] i, i.e. if 

no interval of a node v' G V' starts at position i in BWTt, then we can just set x = 1 -F ^ip] stop. 

Otherwise, it could happen that just a (possibly empty) subset of all the nodes in V' whose interval starts 
at position i in BWT t correspond to W or to right extensions of W: the intervals of such nodes necessarily 
end inside [i..j]. All the other intervals that start at position i could correspond instead to prefixes of W, 
and they necessarily end after position j in BWTp. Thus, let \i"..j"] be the interval in last that corresponds 
to [i..j]: specifically, let i" = min{p G [1../3] : Last[p] > i} and j" = max {9 G : Lastjg] < j}. To 

determine the number of intervals that start at position i in BWTy and that correspond to prefixes of W, 
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it suffices to compute the difference 6 between the number of starting positions and the number of ending 
positions inside interval as follows: S = ^[p] - F[p]) “ (Eg=i “ T.\=i • Then, 

X = ^[p] ~ such sums can be computed in constant time using the prefix-sum representations 

of ad L. □ □ 

Consider now a factorization of T such that all factors are right-maximal substrings of T, and let V' be 

the set of nodes of ST^ that correspond to the distinct factors. To locate all the occurrences of a pattern 
that cross or end at a boundary between two factors, we just need an implementation of function 1{W, V) 
and a pair of RLBWTs: 

Lemma 3. Let T G be a string, and let T = TiT 2 - ■ - T^ he a factorization of T in which all 

factors are right-maximal substrings. There is a data structure that takes 0{z-\-rT words of space and 

that reports all the occ occurrences of a pattern P G [0..cr]™ that cross or end at a boundary between two 
factors of T, in 0(m(log log n -\- log z) occ log*^ z) time. 


Proof. Let pi,p 2 ,. ■. ,Pz be the sequence such that pi is the starting position of factor Ti in T. The same 
occurrence of P in T can cover up to m boundaries between two factors, thus we organize the computation 
as follows. We consider every possible way to place the rightmost boundary between two factors in P, i.e. 
every possible split of P into two parts P[l..fc — 1] and P[k..m] for k G such that P[k..m] is either a 

factor or a proper prefix of a factor. For every such k, we use four-sided range reporting queries to list all the 
occurrences of P in P that conform to this split, as described in SectionThe four-sided range reporting 
data structure represents the mapping between the lexicographic rank of a factor W among all the distinct 
factors of T, and the lexicographic ranks of all the reversed prefixes T[l..pi — 1] such that Ti = W, among 
all the reversed prefixes of T that end at the last position of a factor. As described in Section 2.2 this data 
structure takes 0{z) words of space. 

We encode sequencepi,p 2 , ■ ■ ■ ,Pz implicitly, as follows: we use a bitvector last[l..n] such that last[i] = 1 
iff SA7jT[i] = n — pj -\- 2 for some j G [l..z], i.e. iff SA;p[z] is the last position of a factor. We represent such 
bitvector as a predecessor data structure with partial ranks, using 0(z) words of space (TH]. Then, we build 
the data structure described in Lemma where V is the set of loci in ST t of all factors of T. This data 
structure takes 0{z) words of space, and together with last, RLBWTs and RLBWT^, it is the output of 
our construction. 

Given a pattern P G [0..ct]'", we first perform a backward search in RLBWTt to determine the number 
of occurrences of P in T: if this number is zero, we stop. During this backward search, we store in a table 
the interval [ik--jk] of P[A:..m] in BWTt for every k G [2..to]. Then, we compute the interval 
of P[l..fc— 1] in BWTy for every k £ [2..to], using backward search in RLBWT^: if rajiki(last, — 
raiiki(last, — 1) = 0, then P[l..fc — 1] never ends at the last position of a factor, and we can discard 

this value of k. Otherwise, we convert [j5j_i..jX-i] to the interval [ranki(last, -h l..ranki(last, 
of all the reversed prefixes of T that end at the last position of a factor. Rank operations on last can be 
implemented in 0(loglog7T,) time using predecessor queries. We get the lexicographic interval of P[A:..to] in 
the list of all the distinct factors of T using operation I(P[fc..TO], V), in O(logz) time. We use such intervals 
to query the four-sided range reporting data structure. □ □ 


The algorithm described in Lemma can be engineered in a number of ways in practice. Here we just 
apply it to the LZ factorization of T to find all the primary occurrences of P in T, and we use the strategy 
described in Section to compute secondary occurrences, obtaining the key result of this section: 

Theorem 3. Let T G [l..cr]”“^^ he a string, and let T = T 1 T 2 ... Tz be its LZ factorization. There is a data 
structure that takes O^z-GrT-GfT) words of space and that reports all the pocc primary occurrences and all the 
socc secondary occurrences of a pattern P G in 0(TO(loglog n-I-log x)-f pocc log'^ z-f socc loglogn) 

time. 
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4 Combining runs in BWT and maximal repeats 

An alternative way to compute all the occurrences of a pattern in a string T consists in combining RLBWTr 
with CDAWG'r, using an amount of space proportional to the number of right extensions of the maximal 
repeats of T: 

Theorem 4. Let T G be a string. There is a data structure that takes 0{eT) words of space 

(or alternatively, 0{e(p) words of space) and that reports all the occ occurrences of a pattern P G [O.-tr]’" in 
0 (m log log n + occ) time. 

Proof. We build RLBWT-r and CDAWGt- For every node v in the CDAWG, we store \£(y)\ in a variable 
u.length. Recall that an arc {v,w) of the CDAWG means that maximal repeat £{w) can be obtained by 
extending maximal repeat £{v) to the right and to the left. Thus, for every arc 7 = {v,w) of GDAWG^, we 
store the first character of £( 7 ) in a variable 7 .char, and we store the length of the right extension implied 
by 7 in a variable 7 .right. The length 7 .left of the left extension implied by 7 can be computed by 
w.length — u.length — 7 .right. Clearly arcs of GDAWG^ that correspond to edges of STy in set £tp induce 
no left extension. For every arc of GDAWG^ that connects a maximal repeat W to the sink, we store just 
7 .char and the starting position 7 .pos of string W ■ 7 .char in T. The total space used by the CDAWG 
is clearly 0(e) words, and by Theorem the space used by RLBWTj' is 0{\FJ).\) words. An alternative 
construction could use GDAWG^ and RLBWT^. 

We use the RLBWT to count the number of occurrences of P in T in O(mloglogn) time: if this number 
is greater than zero, we use the CDAWG to report all the occ occurrences of P in T in O(occ) time, using 
the technique sketched in [5] . Specifically, since we know that P occurs in T, we perform a blind search for P 
in the CDAWG, as is typically done with Patricia trees. We keep a variable i, initialized to zero, that stores 
the length of the prefix of P that we have matched so far, and we keep a variable j, initialized to one, that 
stores the starting position of P inside the last maximal repeat encountered during the search. For every 
node V in the CDAWG, we choose the arc 7 such that 7 .char = P[i + 1] in constant time using hashing, 
we increment i by 7 .right, and we increment j by 7 .left. If the search leads to the sink by an arc 7 , we 
report 7 .pos + j and we stop. If the search leads to a node v that is associated with the maximal repeat 
W, we determine all the occurrences of IT in T by performing a depth-first traversal of all the nodes in the 
CDAWG that are reachable from v, updating variables i and j as described above, and reporting y.pos + j 
for every arc 7 that leads to the sink. The total number of nodes and arcs reachable from v is clearly O(occ). 

□ □ 

The combination of CDAWG-p and RLBWTr can also be used to implement a repetition-aware represen¬ 
tation of STt. We will apply the following property to support operations on ST^: 

Property 1. A maximal repeat W = [l..cr]"‘ of T is the equivalence class of all the right-maximal strings 
{W[\..m],... ,W[k..m]\ such that W[k l..m] is left-maximal, and W[i..m] is not left-maximal for all 
i G [2..A:]. Equivalently, the node v' of GDAWG^ with i{v') = W is the equivalence class of the nodes 
{ui,..., Vk} of STy such that i(vi) = W[i..m] for all i G [l-.fc], and such that Vk, Vk-i, ■ ■ ■ ,vi is a maximal 
unary path of Weiner links. 

Thus, the set of right-maximal strings that belong to the equivalence class of a maximal repeat can be 
represented by a single integer k, and a right-maximal string can be identified by the maximal repeat W it 
belongs to, and by the length of the corresponding suffix of W. In BWTt, the right-maximal strings in the 
same equivalence class enjoy the following additional properties: 

Property 2. Let {W[l..m\,... ,W[k..mW be the right-maximal strings that belong to the equivalence class 
of maximal repeat W G [l..cr]"^, and let range(IT[i..m]) = [pi..qi] fori G [1..A:]. Then: 

1. \qi — Pi -\-1\ = \qj — pj -f 1| for all i and j in [1..A:]. 

2. B\NTT[pi..qi] = W[i — fori G [2..A:]. Conversely, BWTr[pi..( 3 'i] contains at least two distinct 

characters. 
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StringDepth 

locateLeaf 

isAncestor 

parent 

nextSibling 

child 

firstChild 

suffixLink 

weinerLink 

edgeChar 

nLeaves 

1 

0(1) 

0(1) 

0(log log n) 

0(1) 

0 (log log n) 

0(log log n) 

0 (log log n) 

0(1) 

2 

0(1) 


0(log log n) 

0(1) 

0(1) 





Table 1: Time complexities of two representations of STt- with intervals in BWT-r (row 1 ) and without 
intervals in BWT-r (row 2 ). 

3. Pi -1 = (^[c]+rankc(BWTT,Pi) and qi-i = pi-i+qi—pi for i G [2..A:], where c = W[i—1] = BWT 7 ’[pi]- 

4- Pi+i = selectc(BWTT,Pi — C[c]) and = pi+i + qi — Pi for i G [l..k — 1], where c = W[i] is 
the character that satisfies C[c] < Pi < C[c+ 1]. This can be computed in O(loglogn) time using a 
predecessor data structure that uses 0(a) words of space 

5. Let c G [0..cr], and let range(iT[h.m]c) = [xi..yi] for i G [l..k]. Then, Xi = Pi + Xi — pi and pi = 
Pi + yi-Pi- 

The final property we will exploit relates the equivalence class of a maximal repeat to the equivalence 
classes of its in-neighbors in the CDAWG: 

Property 3. Let w be a node in CDAWGt with £{w) = W G and let = {W\\..m\, ..., W[k..mW 

be the right-maximal strings that belong to the equivalence class of node w. Let , u*} be the in-neighbors 

ofw in CDAWGt, and let {V^,..., V*} be their labels. Then, is partitioned into t disjoint sets S^,..., S^, 
such that SI,, = W[x'^ -\- 2..m],..., lT[a;* -I- |5„i and the right-maximal string G®[p..|y*|] 

labels the parent of the locus of the right-maximal string Wlx’’ -\- p-.rn] in ST^. 

Proof. It is clear that the parent in STj^ of every right-maximal string in the equivalence class of node w 
belongs to the equivalence class of an in-neighbor of w: we focus here just on showing that the in-neighbors 
of w induce a partition on the equivalence class of w. Assume that the character that labels arc 7 = {v'',w) 
in the CDAWG is c. Since arc 7 exists, we can factorize W as AWW®, where ^*[1] = c, and we know that no 
prefix of GW* longer than G* is right-maximal, and that no suffix of W longer than |G*G*| is left-maximal. 
Consider any suffix G*[p..|G*|] of G* that belongs to the equivalence class of G*: if p > 1, then 1G[|A*| -l-p-.m] 
is not left-maximal, thus IG[|A*| -\-p..m] belongs to the equivalence class of IG. Its prefix G*[p..|G*|] is right- 
maximal, and no longer prefix is right-maximal. Indeed, assume that string G*[p..|G*|]Z* is right-maximal 
for some prefix Z* of G*. Since G*[p..|G*|] is not left-maximal, then string G*[p..|G*|]G® is not left-maximal 
either, and this implies that G*G* is right-maximal, contradicting the hypothesis. Thus, string G*[p..|G*|] 
labels the parent of the locus of string IG[|A*| -|-p..m] in STy. If p = 1 and G*G* is not left-maximal, the 
same argument applies. If G*G* is left-maximal, then W = G*G*, and since no right-maximal prefix of IG 
longer than G* exists, we have that G* labels the parent of the locus of IG in ST-p. □ □ 

Combining Properties and we obtain the following result: 

Theorem 5. Let T G be a string. There are two implementations of STt that take 0{eT + e^) 

words of space each, and that support the operations in Table [7] with the specified time complexities. 

Proof. We build RLBWTt and CDAWGt, and we annotate the latter as described in Theorem]^ with the 
only difference that arcs that connect a maximal repeat to the sink are annotated with character and length 
like all other arcs. We store in every node v of the CDAWG the number u.size of right-maximal strings that 
belong to its equivalence class, the interval [u.f irst..u.last] oi£{v) in BWTp, a linear-space predecessor data 
structure m on the boundaries induced on the equivalence class of v by its in-neighbors (see Observation 
[^, and pointers to the in-neighbor that corresponds to the interval associated with each boundary. Finally, 
we add to the CDAWG all suffix links {v,w) from ST^ such that both v and w are maximal repeats, and 
the corresponding explicit Weiner links. 

We represent a node v of ST^ as a tuple id(u) = {v', \£{v)\,i,j), where G is the node in GDAWGt that cor¬ 
responds to the equivalence class of v, and [i..j] is the interval of £{v) in BWTj’. Thus, operation stringDepth 














can be implemented in constant time, and if is a leaf, the second component of id(z;) is its starting position 
in T. Operation isAncestor can be implemented by testing the containment of the corresponding intervals 
in EWT^. To implement operation suffixLink, we first check whether \i{v)\ = w'. length — r;'.size + 1: if 
so, we take the suffix link {v',w') from v' and we return (w', icblength, uibf irst, w'.last). Otherwise, we 
return {v', |^(f)| — where [i'--j'] is computed as described in pointof Property]^ To implement 

weinerLink for some character c, we first check whether |•^(^’)| = ublength: if so, we take the Weiner link 
from v' labeled by character c (if any), and we return (w', wblength — w' .size + where 

[i'..j'] is computed by taking a backward step with character c from [u'.f irst..t)'.last]. Otherwise, we check 
whether BWTj’[i] = c: if so, we return {v', \(.{v) \ + l,f',/), where [i'../] is computed as described in point 
of Property 

To implement child for some character c, we follow the arc 7 = {v',w') in the CDAWG labeled by c 
(see Observation]^, and we return tuple {w', |f'(?^)| + y.right, *',/), where [i'-.j'] is computed as described 
in pointof Property]^ To implement parent we exploit Property]^ i.e. we determine the partition of the 
equivalence class of v' that contains v by searching the predecessor of value \£{v)\ in the set of boundaries 
of v': this can be done in O (log log n) time [H]. Let 7 = {u',v') be the arc that connects to v' the in¬ 
neighbor u' associated with the partition that contains v. we return tuple (m', \(-{v) \ — 7 .right, f', j'), where 
i' = i — uLf irst -f uLf irst and f = j + wLlast — ublast as described in point[^of Property]^ Operation 
nextSibling can be implemented in the same way. 

We read the label of an edge 7 of ST^ in O (log log n) time per character (operation edgeChar), by storing 
RLBWTtjt and the interval in BWT^ of the reverse of the maximal repeat that corresponds to every node 
of the CDAWG. By removing from id(u) the interval of i{v) in BWTj’, we can implement stringDepth, 
child, firstChild and suffixLink in constant time, and parent and nextSibling in O(loglogn) time. 

□ □ 

Corollary 1. LetT G [l..cr]"“^^ be a string. There is an implementation ofSTx that takes 0{eT+e^) words 
of space, that computes the matching statistics of a pattern S G [l..cr]"^ with respect to T in 0(m log log n) 
time, and that can be traversed in 0 (n log log n) time and in a constant number of words of space. 

Proof. We combine the implementation in the first row of Table with the folklore algorithm for matching 
statistics, that issues suffixLink and child operations on STt’, and that reads the label of some edges of 
STy. For traversal, we combine the implementation in the second row of Tablewith the folklore algorithm 
that issues just firstChild, parent and nextSibling operations. □ □ 

By storing RLBWT^ in addition to RLBWTr, and by adding to id(u) the interval of £{v) in BWT^, we 
can also implement a bidirectional index on T like those described in [5], that supports the left and right 
extension of a string with any character in O(loglogn) time and that takes 0(e -I- e^) words of space. 
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Appendix 

Lower bound on the number of arcs in the CDAWG 

Lemma 4. The number of arcs in CDAWG^ is r2(logn) for any string T £ and any a < n — 1. 

Proof. CDAWGt must contain a path from the source to the node that corresponds to every suffix of T, and 
since such paths are n, we need logn bits to discriminate at least one of these paths from the others. If 
cr = 2, every node of the CDAWG has exactly two outgoing arcs, thus there must be a path from the source 
to the node associated with a suffix of T that has length at least logn. If cr > 2, we can transform GDAWG-p 
into a DAG with degree at most two by multiplying the number of nodes and arcs by a factor of at most 
two. Indeed, if a node v has outdegree k, we can replace the arcs that start from v with a tree rooted at v 
whose leaves are the original destinations of the arcs from v: this tree has k — 2 additional nodes and 2k — 2 
arcs. The DAG that results from this transformation must have at least logn arcs, thus the number of arcs 
in GDAWGt- is f2(logn). □ □ 

The same proof clearly holds for left extensions of maximal repeats, using GDAWG^ rather than CDAWGy. 

Proof of Lemma [T] 

Proof. Let v be an internal node of ST^ such that £{v) is a maximal repeat of T, and let v' be the internal 
node of STtj? such that £{v') = £{v). Then, for every edge {v, w) £ F'' in STy such that v = parent(u>) there 
is an implicit Weiner link from v' in ST^ labeled by the first character of i{v,w). Gonversely, an implicit 
Weiner link labeled by character b £ [0..(t] from any internal node v' of ST^ implies that 1 = |E^(&^(u'))| < 
|E^(£(u'))|, therefore it must be that |EA(£(u'))| > I. It follows that £(v') is a maximal repeat of T, thus 

there is a node v in ST^ with £(v) = £(v'), and b is the first character of the label of an edge (y,w) £ 
such that V = parent (re). 

Similarly, for every edge (y,w) £ such that v = parent (w) there is an explicit Weiner link from v' 
in STip labeled by the first character of £{v,w). Gonversely, an explicit Weiner link labeled by character 
b £ [0..(t] from any internal node v' of ST^ with at least two Weiner links implies that string £(v') is a maximal 
repeat, and that there is an edge {v,w) £ S'" such that £{v) = £{v'), v = parent(ic), and £{v,w) = bV for 
some I^ S [0 ..(t]*. □ □ 

Lemma immediately implies that the strings in Aifp label internal nodes of STt’ that are not the 
destination of any suffix link. However, there can be internal nodes of ST^^ that are not the destination of 
any suffix link but that are not maximal repeats. 
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